Logistic regression determines the values of the regression coefficients that are most consistent with the observed data

using what’s called the maximum likelihood criterion. The likelihood of any statistical model is the probability (based on the

model) of obtaining the values you observed in your data. There’s a likelihood value for each row in the data set, and a total

likelihood (L) for the entire data set. The likelihood value for each data point is the predicted probability of getting the observed

outcome result. For individuals who died (refer to Table 18-1), the likelihood is the probability of dying (Y) predicted by the

logistic formula. For individuals who survived, the likelihood is the predicted probability of not dying, which is

. The total

likelihood (L) for the whole set of individuals is the product of all the calculated likelihoods for each individual.

To find the values of the coefficients that maximize L, it is most practical to find the values that minimize the quantity 2

multiplied by the natural logarithm of L, which also called the –2 log likelihood and abbreviated –2LL. Statisticians also call

2LL the deviance. The closer the curve designated by the regression formula comes to the observed points, the smaller this

deviance value will be. The actual value of the deviance for a logistic regression model doesn’t mean much by itself. It’s the

difference in deviance between two models you might be comparing that is important.

Once deviance is calculated, the final step is to identify the values of the coefficients that will minimize the deviance of the

observed Y values from the fitted logistic curve. This may sound challenging, but statistical programs employ elegant and

efficient ways to minimize such a complicated function involving several variables, and uses these methods to obtain the

coefficients.

Handling multiple predictors in your logistic model

The data in Table 18-1 have only one predictor variable, but you may have several predictors of a

binary outcome. If the data in Table 18-1 were about humans, you would assume the chance of dying

from radiation exposure may depend not only on the radiation dose received, but also on age, gender,

weight, general health, radiation wavelength, and the amount of time over which the person was

exposed to radiation. In Chapter 17, we describe how the straight-line regression model can be

generalized to handle multiple predictors. You can generalize the logistic formula to handle multiple

predictors in the same way.

Suppose that the outcome variable Y is dependent on three predictors called X, V, and W. Then the

multivariate logistic model looks like this:

Logistic regression finds the best-fitting values of the parameters a, b, c, and d given your data. That

way, for any particular set of values for X, V, and W, you can use the equation to predict Y, which is the

probability of being positive for the outcome.

Running a Logistic Regression Model with

Software

The theory behind logistic regression is difficult to grasp, and the calculations are complicated

(see the sidebar “Getting into the nitty-gritty of logistic regression” for details). The good news is

that most statistical software (as described in Chapter 4) can run a logistic regression model, and

it is similar to running a straight-line or multiple linear regression model (see Chapters 16 and

17). Here are the steps:

1. Make sure your data set has a column for the outcome variable that is coded as 1 where the